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Abstract 

We consider the problem of matrix column subset selection, which selects a subset of 
columns from an input matrix such that the input can be well approximated by the span of 
the selected columns. Column subset selection has been applied to numerous real-world data 
applications such as population genetics summarization, electronic circuits testing and recom¬ 
mendation systems. In many applications the complete data matrix is unavailable and one needs 
to select representative columns by inspecting only a small portion of the input matrix. In this 
paper we propose the first provably correct column subset selection algorithms for partially ob¬ 
served data matrices. Our proposed algorithms exhibit different merits and drawbacks in terms 
of statistical accuracy, computational efficiency, sample complexity and sampling schemes, 
which provides a nice exploration of the tradeoff between these desired properties for column 
subset selection. The proposed methods employ the idea of feedback driven sampling and are 
inspired by several sampling schemes previously introduced for low-rank matrix approximation 
tasks IIDMM08IIFKV041IDV06IIKS141 . Our analysis shows that, under the assumption that the 
input data matrix has incoherent rows but possibly coherent columns, all algorithms provably 
converge to the best low-rank approximation of the original data as number of selected columns 
increases. Furthermore, two of the proposed algorithms enjoy a relative error bound, which 
is preferred for column subset selection and matrix approximation purposes. We also demon¬ 
strate through both theoretical and empirical analysis the power of feedback driven sampling 
compared to uniform random sampling on input matrices with highly correlated columns. 

Key words. Column subset selection, active learning, leverage scores. 


1 Introduction 

Given a matrix M G column subset selection problem aims to find s exact columns in 

M that capture as much of M as possible. More specifically, we wanf fo selecf s columns of M fo 
form a “compressed” mafrix C G to minimize the norm of the following residue 

min ||M-CX||. = IIM-CC^MII., (1) 

xeR '’^"2 
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where is the Moore-Penrose pseudoinverse of C and ^ = 2 or F denotes the spectral or Frobe- 
nious norm. In this paper we mainly focus on the Frobenious norm, as was the case in previous the¬ 
oretical analysis for sampling based column subset selection algorithms UDMMOSl IFKV041IDV061 
IDRVW061 . To evaluate the performance of column subset selection, one compares the residue norm 
defined in Eq. Q with ||M — Mfc||^, where is the best rank-A: approximation of M. Usually 
the number of selected columns s is larger than or equal to the target rank k. Two forms of error 

guarantee are common: additive error guarantee in Eq. Q and relative error guarantee in Eq. Q, 

with 0 < e < 1 and c > 1 (ideally c = 1 -|- e). 

||M - CC^MII^ < IlM-Mfcll^-hellMlIi.; (2) 

||M-CC^MII^ < c||M-Mfc||g. (3) 

In general, relative error bound is much more appreciated because ||M||g is usually large in practice. 
In addition, when M is an exact low-rank matrix Eq. Q implies perfect reconstruction, while the 
error in Eq. Q remains non-zero. The column subset selection problem can be considered as a 
form of unsupervised feature selection, which arises frequently in the analysis of large datasets. Eor 
example, column subset selection has been applied to various tasks such as summarizing population 
genetics, testing electronic circuits, recommendation systems, etc. Interested readers should refer 
to IIBMD091IBNBIOI for further motivations. 

Many methods have been proposed for the column subset selection problem UChaSVl IGE961 
IEKV041IDRVWO^ iDMMOSl IBDMI141 . An excellent summarization of these methods and their 
theoretical guarantee is available in Table 1 in IIBMD091 . Most of these methods can be roughly 
categorized into two classes. One class of algorithms are based on rank-revealing QR (RRQR) 
decomposition UChaSVl IGE96I and it has been shown in IIBMD091I that RRQR is nearly optimal 
in terms of residue norm under the s = k setting, that is, exact k columns are selected to recon¬ 
struct an input matrix. On the other hand, sampling based methods IIEKV041IDRVWO^ IDMMOSl 
try to select columns by sampling from certain distributions over all columns of an input matrix. 
Extension of sampling based methods to general low-rank matrix approximation problems is also 
investigated llCEM~*~15llBJS15l . These algorithms are much faster than RRQR and achieves com¬ 
parable performance if the sampling distribution is carefully selected and slight over-sampling (i.e., 
s > k) is allowed HDRVW06 1 IDMMOSl . In IBMD09I sampling based and RRQR based algorithms 
are unified fo arrive af an efficienf column subsef selection mefhod fhaf uses exacfly s = k columns 
and is nearly optimal. 

Allhough fhe column subsef selection problem wifh access lo fhe full inpul malrix has been 
extensively sludied, often in practice if is hard or even impossible lo oblain fhe complele dala. Eor 
example, for fhe genetic varialion detection problem if could be expensive and time-consuming fo 
oblain full DNA sequences of an enlire population. Several heuristic algorilhms have been pro¬ 
posed recenfly for column subsef selecfion wifh missing dala, including fhe Block OMP algorilhm 
IBNBIOI and fhe group Easso formulalion explored in BBXMIOI . Neverlheless, no Iheorelical 
guaranlee or error bounds have been derived for Ihese melhods. The presence of missing dala poses 
new challenges for column subsef selecfion, as many well-eslablished algorilhms seem incapable 
of handling missing dala in an eleganl way. Below we idenlify a few key challenges fhaf prevenl 
application of previous Iheorelical resulls on column subsel selection under Ihe missing dala selling: 

• Coherent matrix design: most previous results on the completion or recovery of low rank 
matrices with incomplete data assume the underlying data matrix is incoherent BRecllllCPlOl 
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IKMQIOI . which intuitively assumes all rows and columns in the data matrix are weakly corre¬ 
lated. On the other hand, previous algorithms on column subset selection and matrix CUR 
decomposition spent most efforts on dealing with coherent matrices IIDRVW061IDMM081 
IBMD091 BW14I . In fact, one can show that under standard incoherence assumptions of ma¬ 
trix completion algorithms a high-quality column subset can be obtained by sampling each 
column uniformly at random, which trivializes the problem HXJZlSi Such gap in problem 
assumptions renders column subset seleetion on incomplete eoherent matrices particularly 
difficult. In this paper, we explore the possibility of a weaker incoherence assumption that 
bridges the gap. We present and discuss detailed assumptions considered in this paper in 
sec.rm 

• Limitation of existing sampling schemes: previous matrix completion methods usually as¬ 
sume the observed data are sampled uniformly at random. However, in IKS 1411 it is proved 
that uniform sampling (in fact any sampling scheme with apriori fixed sampling disfribufion) 
is nof sufficienf fo complefe a coherenf mafrix. Though in IICBSW13I a provably correcf 
sampling scheme was proposed for any mafrix based on sfafisfical leverage scores, which is 
also fhe key ingredienf of many previous column subsef selecfion and mafrix CUR decompo- 
sifion algorifhms iDMMOSl IBMD091 IBW 141 . if is very difficulf fo approximafe fhe leverage 
scores of an incomplefe coherenf mafrix. Common perfurbafion resulfs on singular vecfor 
space (e.g., Wedin’s fheorem) fail because closeness befween fwo subspaces does nof imply 
closeness in fheir leverage scores since fhe laffer are defined in a infinify norm manner (see 
Section [2T] for defails). 

• Limitation of zero filling: A straightforward algorithm for missing data column subset se¬ 
lection is to first fill all unobserved entries with zero and then properly scale the observed 
ones so that the completed matrix is consistent with the underlying data matrix in expecta¬ 
tion IAM07lP^KL13l . Column subset selection algorithms designed for fully observed data 
could be applied afterwards on the zero-filled matrix. However, the zero filling procedure can 
change the underlying subspace of a matrix drastically IBRNIOI and usually leads to addi¬ 
tive error bounds as in Eq. To achieve stronger relative error bounds we need algorithm 
beyond the zero filling idea. 

In this paper, we propose three eolumn subset selection algorithms based on the idea of active 
sampling of the input matrix. In our algorithms, observed matrix entries are chosen sequentially 
and in a feedback-driven manner. We motivate this sampling setting from both practical and the¬ 
oretical perspectives. In applications where each entry of a data matrix M represents results from 
a expensive or time-consuming experiment, it makes sense to carefully select which entry to query 
(experiment), possibly in a feedback-driven manner, so as to reduce experimental cost. For exam¬ 
ple, if M has drugs as its rows and targets (proteins) as its columns, it makes sense to cautiously 
select drug-target pairs for sequential experimental study in order to find important drugs/targets 
with typical drug-target interactions. From a theoretical perspective, we show in Seetion 7.1 that no 
passive sampling scheme is capable of achieving relative-error column subset selection with high 
probability, even if the column space of M is incoherent. Such results suggest that active/adaptive 
sampling is to some extent unavoidable, unless both row and column spaces of M are incoherent. 

*The precise definition of incoherence is given in Section 1.3 
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We also remark that our considered algorithms make very few measurements of the input ma¬ 
trix, which differs from previous feedback-driven re-sampling methods in the theoretical computer 
science literature (e.g., IIWZ1311 1 that requires access to the entire input matrix. Active sampling 
has been shown to outperform all passive schemes in several settings HHCNllL and furthermore it 
works for completion of matrices with incoherent rows/columns under which passive learning prov- 
ably fails IICBSW131IKS131IKS14II . To the best of our knowledge, the algorithms proposed in this 
paper are the first column subset selection algorithms for coherent matrices that enjoy theoretical 
error guarantee with missing data, whether passive or active. Furthermore, two of our proposed 
methods achieve relative error bounds. 

1.1 Assumptions 

Completing/approximating partially observed low-rank matrices using a subset of columns requires 
certain assumptions on the input data matrix M IICPIOIICBSW131 iReclll IXJZ15II . To see this, 
consider the extreme-case example where the input data matrix M consists of exactly one non-zero 
element (i.e., = l{i = i *= j*} for some i* G [ni] and j* G [n2]). In this case, the 

relative approximation quality c = ||M — CC1^M||^/||M — Mi||g in Eq. ^ would be infinity if 
column j* is not selected in C. In addition, it is clearly impossible to correctly identify j* using 
o(nin 2 ) observations even with active sampling strategies. Therefore, additional assumptions on 
M are required to provably approximate a partially observed matrix using column subsets. 

In this work we consider the assumption that the top-A; column space of the input matrix M 
is incoherent (detailed mathematical definition given in Sec. \2A\ , while placing no incoherence or 
spikiness assumptions on the actual columns of M. In addition to the necessity of incoherence 
assumptions for incomplete matrix approximation problems discussed above, we further motivate 
the “one-sided” incoherence assumption from two perspectives: 

- Column subset selection with incomplete observation remains a non-trivial problem even if 
the column space is assumed to be incoherent. Due to the possible heterogeneity of the 
columns, naive methods such as column subsets sampled uniformly at random are in general 
bad approximations of the original data matrix M. Existing column subset selection algo¬ 
rithms for fully-observed matrices also need to be majorly revised to accommodate missing 
matrix components. 

- Compared to existing work on approximating low-rank incomplete matrices, our assumptions 
(one-sided incoherence) are arguably weaker. IIXJZ151 analyzed matrix CUR approximation 
of partially observed matrices, but assumed that both column and row spaces are incoherent; 
IIKS14I derived an adaptive sampling procedure to complete a low-rank matrix with only one¬ 
sided incoherence assumptions, but only achieved additive error bounds for noisy low-rank 
matrices. 

1.2 Our contributions 

The main contribution of this paper is three provably correct algorithms for column subset selection 
via inspecting only a small portion of the input matrix. The sampling schemes for the proposed 
algorithms and their main merits and drawbacks are summarized below: 
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1. Norm sampling; The algorithm is simple and works for any input matrix with incoherent 
columns. However, it only achieves an additive error bound as in Eq. Q. It is also inferior 
than the other two proposed methods in terms of residue error on both synthetic and real-world 
datasets. 

2. Iterative norm sampling; The iterative norm sampling algorithm enjoys relative error guar¬ 
antees as in Eq. Q at the expense of being much more complicated and computationally 
expensive. In addition, its correctness is only proved for low-rank matrices with incoherent 
column space corrupted with i.i.d. Gaussian noise. 

3. Approximate leverage score sampling; The algorithm enjoys relative error guarantee for 
general input matrices with incoherent column space. However, it requires more over-sampling 
and its error bound is worse than the one for iterative norm sampling on noisy low-rank ma¬ 
trices. Moreover, to actually reconstruct the data matrix the approximate leverage score 
sampling scheme requires sampling a subset of both entire rows and columns, while both 
norm based algorithms only require sampling of some entire columns. 

In summary, our proposed algorithms offer a rich, provably correct toolset for column subset 
selection with missing data. Eurthermore, a comprehensive understanding of the design tradeoffs 
among statistical accuracy, computational efficiency, sample complexity, and sampling scheme, etc. 
is achieved by analyzing different aspects of the proposed methods. Our analysis could provide 
further insights into other matrix completion/approximation tasks on partially observed data. 

We also perform comprehensive experimental study of column subset selection with missing 
data using the proposed algorithms as well as modifications of heuristic algorithms proposed re¬ 
cently IBNBIOIIBXMIOI on synthetic matrices and two real-world applications; tagging Single 
Nucleotide Polymorphisms (tSNP) selection and column based image compression. Our empiri¬ 
cal study verifies most of our theoretical results and reveals a few interesting observations that are 
previously unknown. Eor instance, though leverage score sampling is widely considered as the 
state-of-the-art for matrix CUR approximation and column subset selection, our experimental re¬ 
sults show that under certain low-noise regimes (meaning that the input matrix is very close to low 
rank) iterative norm sampling is more preferred and achieves smaller error. These observations open 
new questions and suggest for new analysis in related fields, even for the fully observed case. 


1.3 Notations 

Eor any matrix M we use to denote the f-th column of M. Similarly, M(j) denotes the f-th 
row of M. All norms || • || are £2 norms or the matrix spectral norm unless otherwise specified. 

We assume fhe input matrix is of size m x n2, n = max(ni,n2). We further assume that 
ni < n 2 . We use £Cj = G to denote the i-th column of M. Eurthermore, for any column 
vector Xi G and index subset U C [m], define fhe subsampled vector and the scaled 
subsampled vector Ti^{xi) as 


= T^n{xi) = o Xi 


^See Section 1.3 


for the distinction between selection and reconstruction. 


(4) 
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where G {0,1}”^ is the indicator vector of Q and o is the Hadarmard product (entrywise prod¬ 
uct). We also generalize the definition in Eq. 0 to matrices by applying the same operator on each 
column. 

We use ||M —CC^M||^ to denote the selection error and ||M —CX||g to denote the reconstruc¬ 
tion error. The difference between the two types of error is that for selection error an algorithm is 
only required to output indices of the selected columns while for reconstruction error an algorithm 
needs to output both the selected columns C and the coefficient matrix X so that CX is close to M. 
We remark that the reconstruction error always upper bounds the selection error due to Eq. ([T]). On 
the other hand, there is no simple procedure to compute when M is not fully observed. 

1.4 Outline of the paper 

The paper is organized as follows: in Sectionj^we provide background knowledge and review sev¬ 
eral concepts that are important to our analysis. We then present main results of the paper, the three 
proposed algorithms and their theoretical guarantees in Section Proofs for main results given 
in Section are sketched in Section and some technical lemmas and complete proof details are 
deferred to the appendix. In Sectionj^we briefly describe previously proposed heuristic based algo¬ 
rithm for column subset selection with missing data and their implementation details. Experimental 
results are presented in Section and we discuss several aspects including the limitation of passive 
sampling and time complexity of proposed algorithms in Section]^ 


2 Preliminaries 

This section provides necessary background knowledge for the analysis in this paper. We first review 
the concept of coherence, which plays an important row in sampling based matrix algorithms. We 
then summarize three matrix sampling schemes proposed in previous literature. 


2.1 Subspace and vector incoherence 

Incoherence plays a crucial role in various matrix completion and approximation tasks BReclll 
IKS141ICT101IKMO10I . Eor any matrix M G gf singular value decomposition yields 

M = USV^, where U G and V G orthonormal columns. Eet U = span(U) 

and V = span( V) be the column and row space of M. The column space coherence is defined as 


KK) := ^ 


iittT 

max U e 
k i=l 


2 ^1 IITJ ||2 

*112 = — max||U(i)||2. 


k i=l 


(5) 


Note that p.{U) is always between 1 and rii/k. Similarly, the row space coherence is defined as 


^2 ”-2 ||,;-T I|2 ^2 n2 II ||2 

p[V) := — inax || V ej ||2 = — rnax || V(j) || 2 . 

r£ 2=1 rC 2=1 


( 6 ) 


In this paper we also make use of incoherence level of vectors, which previously appeared in 
|BRN10[IKS131|KS14I . Eor a column vector x G its incoherence is defined as 


p{x) := 


ni\\x\ 


\x\ 


(V) 


6 









It is an easy observation that if x lies in the subspace U then ^i{x) < kp.{U). In this paper 
we adopt incoherence assumptions on the column space U, which subsequently yields incoherent 
column vectors Xi. No incoherence assumption on the row space V or row vectors is made. 

2.2 Matrix sampling schemes 

Norm sampling Norm sampling for column subset selection was proposed in IIFKV04I and has 
found applications in a number of matrix computation tasks, e.g., approximate matrix multiplication 
lIDKMOhall and low-rank or compressed matrix approximation HDKMOhbl IDKMOh^ . The idea is to 
sample each column with probability proportional to its squared £2 norm, i.e., Pr[i G C] oc ||mW|| 2 
for i G {1, 2, • • • , 77 . 2 }. These types of algorithms usually come with an additive error bound on their 
approximation performance. 

Volume sampling For volume sampling IIDRVW06L a subset of columns C is picked with prob¬ 
ability proportional to the volume of the simplex spanned by columns in C. That is, Pr[C'] oc 
vol(A(C)) where A(C) is the simplex spanned by • • • ,Computationally 

efficient volume sampling algorithms exist IIDRIOIIAGRI 6 I . These methods are based on the com¬ 
putation of characteristic polynomials of the projected data matrix IlDRlOII or an MCMC sampling 
procedure IIAGR16II . Under the partially observed setting, both approaches are difficult to apply. 
For the characteristic polynomials approach, one has to estimate the characteristic polynomial and 
essentially the least singular value of the target matrix M up to relative error bounds. This is not pos¬ 
sible unless the matrix is very well-conditioned, which violates the setting that M is approximately 
low-rank. For the MCMC sampling procedure, it was shown in IIAGR16I1 that 0{kn2) iterations are 
needed for the sampling Markov chain to mix. As each sampling iteration requires observing one 
entire column, performing 0 {kn 2 ) iterations essentially requires observing 0 {kn 2 ) columns, i.e., 
the entire matrix M. On the other hand, an iterative norm sampling procedure is known to perform 
approximate volume sampling and therefore enjoy multiplicative approximation bounds for column 
subset selection IIDV061 . In this paper we generalize the iterative norm sampling scheme to the 
partially observed setting and demonstrate similar multiplicative approximation error guarantees. 

Leverage score sampling The leverage score sampling scheme was introduced in IIDMM081I to 
get relative error bounds for CUR matrix approximation and has later been applied to coherent 
matrix completion IICBSW131I . For each row i G {1, • • • , ni} and column j G {1, • • • , 772 } define 
Pi := ^llU^ejlll and Vj := (unnormalized) leverage scores, where U G 

Y g ^ri 2 xk top-k left and right singular vectors of an input matrix M. It 

was shown in UDMMOSII that if rows and columns are sampled with probability proportional to 
their leverage scores then a relative error guarantee is possible for matrix CUR approximation and 
column subset selection. 


3 Column subset selection via active sampling 

In this section we propose three column subset selection algorithms that only observe a small portion 
of an input matrix. All algorithms employ the idea of active sampling to handle matrices with co¬ 
herent rows. While Algorithm [T] achieves an additive reconstruction error guarantee for any matrix. 
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Table 1: Summary of theoretical guarantees of proposed algorithms, s denotes the number of 
selected columns and m denotes the expected number of observed matrix entries. Dependency on 
failure probability 6 and other polylogarithmic dependency is omitted. 



Error type 

Error bound 

s 

771 

Assumptions 

Norm 

||M-CCtM||F 

||M-Mfc||F + £||M||j. 

a{k/e^) 

n(/ii?i) 

maxgj < fii 


I|M-CX||f 

||M-Mfc||F + 2e||M||j- 

n{k/e^) 

n(A:/ii7i/e^) 

same as above 

Iter, norm 

||M-CCtM||F 

V2.5*^(fc + l)!||M-Mq|F 

k 

Q{k^fion) 

M = A + R; fi'iU) < fj,Q 


||M-CCtM||F 

vTT3i||M - 

0(fc^ logfc + k/e) 


same as above 


l|M-CX||j. 

vTT3^||M - Mfcllj. 

0(fc^logfc + k/e) 


same as above 

Lev. score 

||M-CCtM||F 

3(l + e)||M-M,||i. 

n(fcVF) 


fl{U) < tMi 


Algorithm 1 Active norm sampling for column subset selection with missing data 

1: Input: size of column subset s, expected number of samples per column mi and m 2 . 

2: Norm estimation; For each column i, sample each index in Di j C [ni] i.i.d. from 
Bernoulli(mi/ni). observe ^ and compute q = ^ 111 - Define f =Ci. 

3: Column subset selection: Set C = 0 G 

• For f G [s]: sample it G [n 2 ] such that Vi[it = j] = Cj/f . Observe in full and set 

4: Matrix approximation; Set M = 0 G 

• For each column Xi, sample each index in ^ 2 ,i ^ [ni] i.i.d. from Bernoulli(m 2 ,j/ni), 
where m 2 ,j = m^n^hlf'-, observe 

• Update: M = M + 

5: Output: selected columns C and coefficient matrix X = C^M. 


Algorithm achieves a relative-error reconstruction guarantee when the input matrix has certain 
structure. Finally, Algorithm [^achieves a relative-error selection error bound for any general input 
matrix at the expense of slower error rate and more sampled columns. Table [T] summarizes the main 
theoretical guarantees for the proposed algorithms. 

3.1 I 2 norm sampling 

We first present an active norm sampling algorithm (Algorithm[T]) for column subset selection under 
the missing data setting. The algorithm is inspired by the norm sampling work for column subset 
selection by Frieze et al. IIFKV041I and the low-rank matrix approximation work by Krishnamurthy 
and Singh IIKS14II . 

The first step of Algorithm[T]is to estimate the I 2 norm for each column by uniform subsampling. 
Afterwards, s columns of M are selected independently with probability proportional to their ^2 
norms. Finally, the algorithm constructs a sparse approximation of the input matrix by sampling 
each matrix entry with probability proportional to the square of the corresponding column’s norm 
and then a CX approximation is obtained. 

When the input matrix M has incoherent columns, the selection error as well as CX recon- 















struction error can be bounded as in Theorem [T] 


Theorem 1. Suppose p-{xi) < pifor some positive constant pi. Let C and X be the output 

of Algorithm^ Denote Mj, the best rank-k approximation o/M. Fix <5 = (5i + <52 + (Js > 0. With 
probability at least 1 — 5, we have 

||M - CC^MIIf < ||M - Mfclli. + e||M||F (8) 

provided that s = Ll{ke~‘^ / 82 ), mi = Vt{pi log(n/<5i)). Furthermore, if m 2 = Ll{pis\o^{n/ 8 fj / { 82 ^^)) 
then with probability >1 — 8 we have the following bound on reconstruction error: 

||M - CXlliT < ||M - MfcllF + 2e||M||F. (9) 


As a remark, Theorem [T] shows that one can achieve e additive reconstruction error using Algo- 
rithm[T]with expected sample complexity (omitting dependency on <5) 

f 1 ^ , kpin2log^{n)\ -4 1 „2 ^ 

U ( /iin 2 log(n) + I = ilfkpie nlog n). 

3.2 Iterative norm sampling 

In this section we present Algorithm]^ another active sampling algorithm based on the idea of iter¬ 
ative norm sampling and approximate volume sampling introduced in HDVOhll . Though Algorithm 
l^is more complicated than Algorithm [T] it achieves a relative error bound on inputs that are noisy 
perturbation of some underlying low-rank matrix. 

Algorithm|^employs the idea of iterative norm sampling. That is, after selecting I columns from 
M, the next column (or next several columns depending on the error type) is sampled according 
to column norms of a projected matrix 'Pcx{WVj, where C is the subspace spanned by currently 
selected columns. It can be shown that iterative norm sampling serves as an approximation of 
volume sampling, a sampling scheme that is known to have relative error guarantees IIDRVW061 
SVOhl . 

Theoremshows that when the input matrix M is the sum of an exact low rank matrix A and a 
stochastic noise matrix R, then by selecting exact k columns from M using iterative norm sampling 
one can upper bound the selection error ||M — CC1 ^M||f by the best rank-fc approximation error 
||M — M/sIIf within a multiplicative factor that does not depend on the matrix size n. Such relative 
error guarantee is much stronger than the additive error bound provided in Theorem [T] as when M 
is exactly low rank the error is eliminated with high probability. In fact, when the input matrix 
M is exactly low rank the first phase of the proposed algorithm (Line 1 to Line 9 in Algorithm 
resembles the adaptive sampling algorithm proposed in IIKS131 IKS 141 for matrix and tensor com¬ 
pletion in the sense that at each iteration all columns falling exactly onto the span of already selected 
columns will have zero norm after projection and hence will never be sampled again. However, we 
are unable to generalize our algorithm to general full-rank inputs because it is difficult to bound the 
incoherence level of projected columns (and hence the projection accuracy itself later on) without a 
stochastic noise model. We present a new algorithm with slightly worse error bounds in Section 
which can handle general high-rank inputs. 
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Algorithm 2 Active iterative norm sampling for column subset selection for data corrupted by 
Gaussian noise_ 

1: Input: target rank k < min(ni, 712 ), error tolerance parameter e, 6 and expected number of 
samples per column m. 

2: Entrywise sampling: For each column i, sample each index in an index set C [m] i.i.d. 
from Bernoulli(m/ni). Observe Xi^Q-. 

3: Approximate volume sampling: Set C = U = %. Let U be an orthonormal basis of U. 

for f = 1, 2, • • • , k do 

(t) — «•! Ilor., „ TT„ triT tt„ 'i-IttT „ ||2 


4 

5 

6: Set/W = EriiG 


Fori e {I,-- - , ns}, compute 

n2 fkt) 

Select a column it at random, with probability Pr[it = j] = 

Observe in full and update: C ■<— C U {it], hi ^ span(^, 

9: end for 

10: Active norm sampling: set T = (A: + 1) log(A: + 1) and si = ss = • • • = sr-i = 5A:, 
St = 10A:/e5; S' = 0, 5 = 0. Suppose U is an orthonormal basis of span(A^, S). 

11 : for f = 1, 2, • • • , T do 

12: Fori £ {I,--- ,112), compute of’ = S'lli'i.n, - 

13: Set /(•) = y.T.1 af’- 

14: Select St columns St = (ii, • • • ,ist) independently at random, with probability Pr[j G 

St] = qf = cf/p\ 

15: Observe in full and update: 5 •(— S' U 5t, 5 •(— span(5, 

16: end for 

17: Matrix approximation: M = Yji=i where U G ]^»^ix(s+fc) 

orthonormal basis of span(Z^O)^i)- 

18: Output: selected column subsets C = G M®®i^^, S = 

• • • , G M®®!"'^ where s = /c + si + ••• + st and X = SS+M. 


Though Eq. ( [T0| is a relative error bound, the multiplicative factor scales exponentially with 
the intrinsic rank k, which is not completely satisfactory. As a remedy, we show that by slightly 
over-sampling the columns (©(/c^ log k + k/e5) instead of k columns) the selection error as well as 
the CX reconstruction error could be upper bounded by ||M — M^HiF within only a (1 + 3e) factor, 
which implies that the error bounds are nearly optimal when the number of selected columns s is 
sufficiently large, for example, s = oj{k‘^ log k + kjeS). 

Theorem 2. Fix (i > 0. Suppose M = A + R, where A is a rank-k deterministic matrix with 
incoherent column space (i.e., p{h({A)) < pq) and R is a random matrix with i.i.d. zero-mean 
Gaussian distributed entries. Suppose k = 0(ni/log(n2/(5)). Let C, S and X be the output of 
Algorithm^ Then the following holds: 

1. Ifm = Ll{k‘^po log^(n/(5)) then with probability > 1 — <5 

||M-CCtM|||p < ^^^^^' ||R|||. (10) 

0 

The column subset size is k and the corresponding sample complexity is Fl{kf pQn\o^ {n / 5)). 
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Algorithm 3 Approximate leverage score sampling for column subset selection on general input 
matrices_ 

1: Input: target rank k, size of column subset s, expected number of row samples m. 

2: Leverage score estimation: Set 5 = 0. 

• For each row i, with probability m/ni observe the row M(j) in full and update S ^ 
span(5,{M(j)}). 

• Compute the first k right singular vectors of S (denoted by G 'j^n 2 xk^ estimate the 
unnormalized row space leverage scores as Ij = HSfc^ejUl, j G {1, 2, • • • , ni}. 

3: Column subset selection: Set (7 = 0. 

• For f G {1, 2, • • • , s} select a column it G [n 2 ] with probability Pr[it = j] = Pj = Ij/k', 
update (7 t— (7 U {it}. 

4: Output: the selected column indices C C {1,2,•••, 77 , 2 } and actual columns C = 


2. If m = n(e log^(n/5)) with s = 0(A;^ log k + k/e5), then with probability > 1 — (i 

||M - SS'^MllI < ||M - SX||| < (1 + 3e)||R|||. (11) 

The column subset size is Q{k‘^ log k + k/e5) and the sample complexity is (omitting depen¬ 
dence on 5) 

^ (k'^pLQnlogklog^(n) ^ kpon\og^{n)\ 

7 ^ + e )■ 


3.3 Approximate leverage score sampling 

The third sampling-based column subset selection algorithm for partially observed matrices is pre¬ 
sented in Algorithmic The proposed algorithm was based on the leverage score sampling scheme 
for matrix CUR approximation introduced in UDMMOSII . To compute the sampling distribution 
(i.e., leverage scores) from partially observed data, the algorithm subsamples a small number of 
rows from the input matrix and use leverage scores of the row space of the subsampled matrix to 
form the sampling distribution. Note that we do not attempt to approximate leverage scores of the 
original input matrix directly; instead, we compute leverage scores of another matrix that is a good 
approximation of the original data. Such technique was also explored in IIDMIMW12I to approx¬ 
imate statistical leverages in a fully observed setting. Afterwards, column sampling distribution is 
constructed using the estimated leverage scores and a subset of columns are selected according to 
the constructed sampling distribution. 

We bound the selection error ||M — CC^^M||i? of the approximate leverage score sampling algo¬ 
rithm in Theorem]^ Note that unlike Theorem [T] andonly selection error bound is provided since 
for deterministic full-rank input matrices it is challenging to approximately compute the projection 
of M onto span(C) because the projected vector may no longer be incoherent (this is in fact the 
reason why Theoremholds only for low-rank matrices perturbed by Gaussian noise). It remains 
an open problem to approximately compute C^M given C with provable guarantee for general ma¬ 
trix M without observing it in full. Eq. (|^ shows that Algorithmj^enjoys a relative error bound on 
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the selection error. In fact, when the input matrix M is exactly low rank then Algorithm [^reduces 
to the two-step matrix completion method proposed in IICBSW13II for column incoherent inputs. 

Although Theorem shows that Algorithm generalizes the relative selection error bound in 
Theorem]^ to general input matrices, it also reveals several drawbacks of the approximate leverage 
score sampling algorithm compared to the iterative norm sampling method. First, Algorithm 
always needs to over-sample columns (at the level of 0(A:^/e^), which is even more than Algorithm 
l^for a (1 -|- e) reconstruction error bound); in contrast, the iterative norm sampling algorithm only 
requires exact k selected columns to guarantee a relative error bound. In addition, Eq. ( [T^ shows 
that the selection error bound is suboptimal even if s is sufficiently large because of the (3 -|- 3e) 
multiplicative term. 

Theorem 3. Suppose M is an input matrix with incoherent top-k column space (i.e., p{Uk(M.)) < 
po) and C is the column indices output by Algorithm^ If m = p^k"^ \og{\/5)) and s = 

log(l/5)) then with probability > 1 — (5 the following holds: 

||M - CC^MlIir < 3(1 + e)||M - MkWp, (12) 

where C = e are the selected columns and is the best rank-k 

approximation o/M. 

4 Proofs 

In this section we provide proof sketches of the main results (Theorem [TJ|^ and [^. Some technical 
lemmas and complete proof details are deferred to Appendix [A| and [B| 

4.1 Proof sketch of Theorem [1] 

The proof of Theorem [T] can be divided into two steps. First, in Lemma [T] we show that (approxi¬ 
mate) column sampling yields an additive error bound for column subset selection. Its proof is very 
similar to the one presented in IIFKV04I and we defer it to Appendix Second, we cite a lemma 
from IKS 1411 to show that with high probability the first pass in Algorithm[T]gives accurate estimates 
of column norms of the input matrix M. 

Lemma 1. Provided that {I — a)\\xi \\2 < Ci < {1 + a)\\xi\\^for i = ,n 2 , with probability 

>1 — 5 we have 


(13) 

where is the best rank-k approximation o/M. 

Lemma 2 ( IIKS141 . Lemma 10). Fix a, 5 G (0,1). Assume p{xi) < po holds for i = 1, 2, • • • , n 2 - 
For some fixed f G {1, • • • , n 2 } with probability > 1 — 26 we have 

(1 - a)||£Ci||2 < Cj < (1 -h a)||a:j||2 (14) 

with a = log(l/5) -|- log(l/(5). Furthermore, if mi = Q{polog{n 2 / 6 )) with carefully 

chosen constants then Eq. & holds uniformly for all columns with a = 0.5. 


|M-iPc(M)||i. < ||M-Mfc||F + 


(1 -|- a.)k 
(1 — a;)()s 


|M| 


12 













Combining Lemma[^and Lemmaj^and setting s = Q{ke~'^/6) for some target accuracy thresh¬ 
old e we have that with probability 1 — 3(5 the selection error bound Eq. Q holds. 

In order to bound the reconstruction error ||M — CX|||,, we cite another lemma from IIKS14II 
that analyzes the performance of the second pass of Algorithm [T] At a higher level, Lemma is a 
consequence of matrix Bernstein inequality IITrol21l which asserts that the spectral norm of a matrix 
can be preserved by a sum of properly scaled randomly sampled sub-matrices. 


Lemma 3 ( IIKS14I . Lemma 9). Provided that (1 — a)||a;j ||2 < Q < (1 -|- Q()||a;i|| 2 /or i = 
1, 2, • • • , 77,2, with probability > 1 — 6 we have 


]V[—]V[||2 ^ iiiviiIt? 

(15) 



The complete proof of Theorem [T] is deferred to Appendix [A| 

4.2 Proof sketch of Theorem |2] 

In this section we give proof sketch of Eq. ( [T0| ) and Eq. ( [TT] ) separately. 

4.2.1 Proof sketch of ||M - CCtMUp error bound 

We take three steps to prove the ||M — CC^MHir error bound in Theorem|^ At the first step, we 
show that when the input matrix has a low rank plus noise structure then with high probability/or 
all small subsets of columns the spanned subspace has incoherent column space (assuming the low- 
rank matrix has incoherent column space) and furthermore, the projection of the other columns onto 
the orthogonal complement of the spanned subspace are incoherent, too. Given the incoherence 
condition we can easily prove a norm estimation result similar to Lemma which is the second 
step. Einally, we note that the approximate iterative norm sampling procedure is an approximation 
of volume sampling, a column sampling scheme that is known to yield a relative error bound. 

STEP 1 We first prove that when the input matrix M is a noisy low-rank matrix with incoher¬ 
ent column space, with high probability a fixed column subset also has incoherent column space. 
This is intuitive because the Gaussian perturbation matrix is highly incoherent with overwhelming 
probability. A more rigorous statement is shown in Lemma 

Lemma 4. Suppose A has incoherent column space, i.e., p{U{A)) < pQ. Fix C C [ 772 ] to be any 
subset of column indices that has s elements and 5 > 0. Let C = • • • , G 

be the compressed matrix and tl{C) = span(C) denote the subspace spanned by the selected 
columns. Suppose max(s, k) < ni/A — k and log(4?72/(5) < ni/64. Then with probability > 1 — (5 
over the random drawn o/R we have 

njmw 1177 ii2 n (+ s + y/s\og{ni/5)+ \og{ni/5)\ 

p{U{C)) = — max Pw(C)ei 2 = O - ; (16) 

S l<i<ni \ / 
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furthermore, with probability > 1 — <5 the following holds: 


/^(^w(c)-l(mW)) = O(fc/io + log(nin2/(^)), Vi ^ C. (17) 


At a higher level, Lemma|^is a consequence of the Gaussian white noise being highly incoher¬ 
ent, and the fact that the randomness imposed on each column of the input matrix is independent. 
The complete proof can be found in Appendix [B| 

Given Lemmaj^ Corollary|T holds by taking a uniform bound over all = 0(s(n.2)®) 

column subsets that contain no more than s elements. The 2s log(4n2/()) < ni/64 condition is only 
used to ensure that the desired failure probability 6 is not exponentially small. Typically, in practice 
the intrinsic dimension k and/or the target column subset size s is much smaller than the ambient 
dimension ni. 

Corollary 1. Fix <5 > 0 and s > k. Suppose s < ni/8 and 2slog(4n2/(5) < ni/64. With 
probability >1 — 5 the following holds: for any subset C C [ 712 ] with at most s elements, the 
spanned subspace U{C) satisfies 

p{U{C)) < 0{{k + s)|C'|-Volog(n/5)); (18) 

furthermore, 

piVuicF = 0{{k + s)po login/6)), \/i i C. (19) 


STEP 2 In this step, we prove that the norm estimation scheme in Algorithm works when the 
incoherence conditions in Eq. (181 and ( [T9| ) are satisfied. More specifically, we have fhe following 
lemma bounding fhe norm esfimafion error: 


Lemma 5. Fix i € {!,■ ■ ■ , 712 }, t € {!,■ ■ ■ ,k} and 6,6' > 0. Suppose Eq. (18) and (19) hold with 
probability >1 — 5. Let St be the subspace spanned by selected columns at the t-th round and let 
denote the estimated squared norm of the ith column. If m satisfies 


777 = ^(/c/io log(77/5) log(/c/5')), (20) 

then with probability > 1 — 5 — 45' we have 

^ll[Ei]«lli<2f <^ll[E7](q||i. (21) 

Here = Vgx (M) denotes the projected matrix at the t-th round. 

Lemma 1^ is similar wifh previous resulfs on subspace defecfion HBRNlOl and mafrix approxi- 
mafion IIKS14II . The infuifion behind Lemma is fhaf one can accurafely esfimafe fhe ^2 norm of 
a vector by uniform subsampling enfries of fhe vecfor, provided fhaf fhe vecfor ilself is incoherenf. 
The proof of Lemma is deferred fo Appendix [B| 

Similar to fhe firsl sfep, by faking a union bound over all possible subsefs of picked columns 
and 772 — A: unpicked columns we can prove a sfronger version of Lemma as shown in Corollary 
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Corollary 2. Fix S, S' > 0. Suppose Eq. ( |7^ and ( |i^ hold with probability >1 — S. If 

m > Q{k‘^polog{n/6)log{n/5')) (22) 

then with probability > 1 — <5 — 4:6' the following property holds for any selected column subset by 
Algorithm^ 

(23) 




-‘tWp 


where pf'^ = sampling probability of the ith column at round t. 


STEP 3 To begin with, we define volume sampling distributions: 

Definition 1 (volume sampling, IIDRVW06I ). A distribution p over column subsets of size k is a 
volume sampling distribution if 

vol(A(C'))2 


p{C) = 


Z]r:|T|=fe '^ol(A(r)) 


2 ’ 


V|C| = k. 


(24) 


Volume sampling has been shown to achieve a relative error bound for column subset selection, 
which is made precise by Theorem]^ cited from ODVOhilDRVWOhl . 

Theorem 4 ( IIDV06I . Theorem 4). Fix a matrix M and let denote the best rank-k approximation 


o/M. If the sampling distribution p is a volume sampling distribution defined in Eq. (24 1 then 

Ec [||M - VviomWl] <{k + 1)||M - MkWh (25) 

furthermore, applying Markov’s inequality one can show that with probability > 1 — (5 

|2 


|M - iPv(c)(M)||^ < ^||M - Mfcll^. 


(26) 


In general, exact volume sampling is difficult to employ under partial observation settings, as 
we explained in Sec. 2.2 However, in IIDV06I it was shown that iterative norm sampling serves 
as an approximate of volume sampling and achieves a relative error bound as well. In Lemma 
we present an extension of this result. Namely, approximate iterative column norm sampling is an 
approximate of volume sampling, too. Its proof is very similar to the one presented in IIDV06II and 
we defer it to Appendix [B] 


Lemma 6. Let p be the volume sampling distribution defined in Eq. ( 24 1. Suppose the sampling 
distribution of a k-round sampling strategy p satisfies Eq. ( |25p . Then we have 

PC < 2.5’^klpc, V|C| = k. (27) 


We can now prove the error bound for selection error ||M — CC^M||i7 of Algorithm I by 
combining Corollary Lemmaand Theorem]^ with failure probability 6,6' set at 0(1/k) 
to facilitate a union bound argument across all iterations. In particular. Corollary [T] and guaran¬ 
tees that Algorithm [^estimates column norms accurately with high probability; then one can apply 
Lemma to show that the sampling distribution employed in the algorithm is actually an approx¬ 
imate volume sampling distribution, which is known to achieve relative error bounds (by Theorem 

0 - 
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4.2.2 Proof sketch of ||M — SX||ir error bound 

We first present a theorem, which is a generalization of Theorem 2.1 in IIDRVW061 . 

Theorem 5 t UDRVWOhl . Theorem 2.1). Suppose M G input matrix and U C M”i 

is an arbitrary vector space. Let S G be a random sample of s columns in M from a 

distribution q such that 


(l-«)||EW||i ^ (l + a)||EW||i 

(l + a)||E||^ (l-a)||E||| 


Vz G ,n2}, 


where E = Vi^x (M) is the projection o/M onto the orthogonal complement of Li. Then 

E5 [||M - npan{W,S).fc(M)|||.] < ||M - Mfc||| + il±^||E|||, 


(28) 


(29) 


where denotes the best rank-k approximation o/M. 

Intuitively speaking, Theoremj^states that relative estimation of residues Vkx (M) would yield 
relative estimation of the data matrix M itself. 

In the remainder of the proof we assume s = log(A;) + k/e5) is the number of columns se¬ 
lected in S in Algorithm]^ Corollaryj^asserts that with high probability//(W(S)) = 0(s|C'|“^/ro log(n/(5)) 
and piVuiSjf (M^)) = 0{spo log(n/(5)) for any subset S with IS"! < s. Subsequently, we can ap¬ 
ply Lemma|^and a union bound over n 2 columns and T rounds to obtain the following proposition: 

Proposition!. Fix6,S' > 0. Ifm = Ll{stio^og{n/S)log{nT/5')) then with probability > 1 — 5 — 6' 


m^\\l 

5||Ei||| 



< 


5||E 


(i)i 


2||E 


t\\F 


Vi G {1,2,--- ,n2},t G {1,2,--- ,T}. 


(30) 


Here E^ = M — 'Pspan(wu5iu--u5t-i)(^) residue at round t of the active norm sampling 

procedure. 

Note that we do not need to take a union bound over all ("‘f) column subsets because this time 
we do not require the sampling distribution of Algorithm]^ to be close uniformly to the true active 
norm sampling procedure. 

Consequently, combining Theoremj^and Proposition[T]we obtain Lemma|^ Its proof is deferred 
to Appendix [B| 

Lemma 7. Fix 5,5' > 0. Ifm = Ll{sp,o^og^{n/5)) and si = ■ ■ ■ = st-i = 5A:, st = Wk/ed' 
then with probability > 1 — 25 — 5" 


M - )Pwu5iU...u5r(M)||| < (1 + e/2)||M 


rnkWi + ^-^m-vumwi. 


(31) 


Applying Theorem|^ Lemmaj^and note that 2 (*'+^Vog(fc-i-i) — i^(A:+i) > {f + 1)!, we 

immediately have Corollary 
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Corollary 3. Fix 5 > 0. Suppose T = (A: + 1) log{k + 1) and m, si, • • • , st be set as in Lemma^ 
Then with probability >1—45 one has 

||M - SS^MllI = ||M - Pwu5iU...u5t(M)||| < (1 + e)||M - Mk\\l < (1 + e)||R|||.. (32) 


To reconstruct the coefficient matrix X and to further bound the reconstruction error ||M — 
SXIIj?, we apply the U(UQUf 2 )“^UQ operator on every column to build a low-rank approximation 
M. It was shown in IIKS131IBRN101 that this operator recovers all components in the underlying 
subspace U with high probability, and hence achieves a relative error bound for low-rank matrix 
approximation. More specifically, we have Lemma which is proved in Appendix [B| 

Lemma 8. Fix 5, 5" > 0 and e > 0. Let S G and X. ^ be the output of Algorithm^ 

Suppose Corollary^holds with probability > 1 — 6. If m satisfies 

m = Q{e~^spo log(n/5) log(n/5")), (33) 

then with probability >1 — 5 — 5" we have 

||M-M|||, < (l + e)||M-SS'^MllI,. (34) 


Nofe fhaf all columns of M are in fhe subspace U{S). Therefore, SX = SS^M = M. The 
proof of Eq. (111 is fhen completed by nofing fhaf (l-|-e)^ < l-|-3e whenever e < 1. 


4.3 Proof of Theorem |3] 

Before presenting fhe proof, we firsf presenf a fheorem cited from BDMMOSi In general, Theoremj^ 
claims fhaf if columns are selecfed wifh probabilify proportional fo fheir row-space leverage scores 
fhen fhe resulfing column subsef is a relative-error approximafion of fhe original inpuf mafrix. 

Theorem 6 f UDMMOSL Theorem 3). Let M G input matrix and k be a rank param¬ 
eter. Suppose a subset of columns C = {ii, 12, - ■ ■ , ^ [n2\ A selected such that 

= j] = Pj > ^ yt e {!,■■■ ,s},j €{!,■■■ ,n2}. (35) 

Here G j^g f^p.p figfii singular vectors o/M. If s = Ll{j3~^e~‘^kf log(l/5)) then with 

probability >1 — 5 one has 

||M-CC^MIIf < (l + e)||M-Mfc||F. (36) 


In fhe sequel we use Q 5 (M) fo denote fhe mafrix formed by projecfing each row of M fo a row 
subspace S and 'Pc'(M) fo denote fhe mafrix formed by projecfing each column of M fo a column 
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Algorithm 4 A block OMP algorithm for column subset selection with missing data 

1: Input: size of column subset s, observation mask W G {0, 

2: Initialization: Set C = 0, C = 0, Y = W o M, = Y. 

3: for f = 1, 2, • • • , s do 

4: Compute D = Y^(W o Y^*)). Let be rows of D. 

5: Column selection: it = argmax;^^^^^^ update: C <r- CVJ{it},C ^ span(C, Yj-j^)). 

6 : Back projection: Y(*+^) = Y^*) — 

7: end for 

8: Output: the selected column indices C C {1, 2, • • • , 77 - 2 }. 


subspace C. Since M has incoherent column space, the uniform sampling distribution pj = 1/ni 
satisfies Eq. (351 with /3 = 1/po- Consequently, by Theorem]^ the computed row space S satisfies 


M - Q5(M)||ir < (1 + e)||M - Mk\\F 


(37) 


with high probability when m = /I3e^) = ^l{pok‘^/e‘^). 

Next, note that though we do not know Q 5 (M), we know its row space S. Subsequently, we 
can compute the exact leverage scores of Q 5 (M), i.e., HSjejUl for j = 1,2, ■■■ , n 2 - With the 
computed leverage scores, performing leverage score sampling on Q 5 (M) as in Algorithm and 
applying Theorem]^ we obtain 


||Qs(M) - Vc{Qs{M))\\f < (1 + e)IIQs(M) - [Qs{M)],^ (38) 


where [Qs(M)]^ denotes the best rank-/c approximation of Q 5 (M). Note that 


||Qs(M)-[Qs(M)]J| 7.< ||Qs(M)-Qs(Mfc)||F = ||Q5(M-Mfc)||i.< ||M-Mfc||F (39) 


because Q 5 (Mfc) has rank at most k. Consequently, the selection error ||M — can be 

bounded as follows: 


M-Vc{M)\\f 


< ||M - Q5(M)||i. + \\Qs{M) - Vc{Qs{M))\\f + \\Vc{Qs{M)) - Vc{M)\\f 

< ||M - Qs(M)||ir + ||Qs(M) - iPc(Qs(M))||i. + \\Qs{M) - M||f 

< 3(1 + e)||M — MfcIliT’. 


5 Related work on column subset selection with missing data 

In this section we review two previously proposed algorithms for column subset selection with 
missing data. Both algorithms are heuristic based and no theoretical analysis is available. We also 
remark that both methods employ the passive sampling scheme as observation models. In fact, they 
work for any subset of observed matrix entries. 


5.1 Block orthogonal matching pursuit (Block OMP) 

A block OMP algorithm was proposed in IBNBIOII for column subset selection with missing data. 
Let W G {0, denote the “mask” of observed entries; that is, 

^ _ f 1) if Mjj is observed; 

( 0, if Mjj is not observed. 
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We also use o to denote the Hadarmard product (entrywise product) between two matrices of the 
same dimension. 

The pseudocode is presented in Algorithmic Note that Algorithm [C has very similar framework 
compared with the iterative norm sampling algorithm: both methods select columns in an iterative 
manner and after each column is selected, the contribution of selected columns is removed from 
the input matrix by projecting onto the complement of the subspace spanned by selected columns. 
Nevertheless, there are some major differences. First, in iterative norm sampling we select a column 
according to their residue norms while in block OMP we base such selection on inner products 
between the original input matrix and the residue one. In addition, due to the passive sampling 
nature Algorithm [C uses the zero-filled data matrix to approximate subspace spanned by selected 
columns. In contrast, iterative norm sampling computes this subspace exactly by active sampling. 


5.2 Group Lasso 

The group Lasso formulation was originally proposed in HBXMIOII as a convex optimization alter¬ 
native for matrix column subset selection and CUR decomposition for fully-observed matrices. It 
was briefly remarked in IBNBlOj that group Lasso could be extended to the case when only partial 
observations are available. In this paper we made such extension precise by proposing the following 
convex optimization problem: 


min ||WoM-(WoM)X|| 

Xgnxnj 


+ A||X| 


1 , 2 , 


.t. diag(X) = 0. 


(40) 


Here in Eq. (WM W G {0,denotes the mask for observed matrix entries and o denotes the 


Hadamard (entrywise) matrix product. ||X||i ^2 = denotes the 1,2-norm of matrix 

X, which is the sum of norm of all rows in X. The nonzero rows in the optimal solution X 
correspond to the selected columns. 

Eq. (|40|) could be solved using standard convex optimization methods, e.g., proximal gradient 
descent iMRS'*~10t . However, to make Eq. (40 1 a working column subset selection algorithm one 


needs to carefully choose the regularization parameter A so that the resulting optimal solution X 
has no more than s nonzero columns. Such selection could be time-consuming and inexact. As a 
workaround, we implement the solution path algorithm for group Easso problems in IIYZ14II . 


5.3 Discussion on theoretical assumptions of block OMP and group Lasso 

We discuss theoretical assumptions required for block OMP and group Lasso approaches. It should 
be noted that for the particular matrix column subset selection problem, neither IBNBIOI or HBXMIOII 
provides rigorous theoretical guarantee of approximation error of the selected column subsets. How¬ 
ever, it is informative to compare to typical assumptions that are used to analyze block OMP and 
group Lasso for regression problems in the existing literature II YL061ILPVDGT iTI . In most cases, 
certain “restricted eigenvalue” conditions on the design matrix X, which roughly corresponds to a 
“weak correlation” condition among columns of a data matrix. This explains the worse performance 
of both methods on data sets that have highly correlated columns (e.g., many repeated columns), as 
we shown in later sections on experimental results. 
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Figure 1: Selection error on Gaussian random matrices. Top row: low-rank plus noise inputs, 
s = k = 15', bottom row: full-rank inputs. The black dashed lines denote noise-to-signal ratio a in 
the first row and ||M — in the second row. a indicates the observation rate (i.e., the number 

of observed entries divided by nin 2 , the total number of matrix entries). All algorithms are run for 
8 times on each dataset and the median error is reported. We report the median instead of the mean 
because the performance of norm and leverage score sampling is quite variable. 

6 Experiments 

In this section we report experimental results on both synthetic and real-world datasets for our 
proposed column subset selection algorithms as well as other competitive methods. All algorithms 
are implemented in Matlab. To make fair comparisons, all input matrices M are normalized so that 

l|M||?. = l. 

6.1 Synthetic datasets 

We first test the proposed algorithms on synthetic datasets. The input matrix has dimension ni = 
712 = n = 50. To generate the synthetic data, we consider two different settings listed below: 

1. Random Gaussian matrices: for random Gaussian matrices each entry Mjj are i.i.d. sam¬ 

pled from a normal distribution 1). For low rank matrices, we first generate a random 
Gaussian matrix B G where k is the intrinsic rank and then form the data matrix M 

as M = BB^. I.i.d. Gaussian noise R with Rjj ~ A/(0, is then appended to the syn¬ 
thesized low-rank matrix. We remark that data matrices generated in this manner have both 
incoherent column and row space with high probability. 

2. Matrices with coherent columns: we took a simple procedure to generate matrices with 
coherent columns in order to highlight the power of proposed algorithms and baseline meth- 
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Figure 2: Selection error on matrices with coherent columns. Top row: low-rank plus noise inputs, 
s = A: = 15; bottom row: full-rank inputs, a indicates the observation rate. The black dashed lines 
denote noise-to-signal ratio a in the first row and ||M — in the second row. All algorithms 

are run for 8 times on each dataset and the median error is reported. 


ods. After generating a random Gaussian matrix M = BB^, we pick a column x from M 
uniformly at random. We then take x = lOx and repeat the column for 5 times. As a result, 
the newly formed data matrix will have 5 identical columns with significantly higher norms 
compared to the other columns. 


In Figure[^we report the selection error ||M — CC^M||^ of proposed and baseline algorithms 
on random Gaussian matrices and in Figure]^ we report the same results on matrices with coherent 
columns. Results on both low-rank plus noise and high-rank inputs are reported. For low-rank ma¬ 
trices, both the intrinsic rank k and the number of selected columns s are set to 15. Each algorithm is 
run for 8 times on the same input and the median selection error is reported. For norm sampling and 
approximate leverage score sampling, we implement two variants: in the sampling with replacement 
scheme the algorithm samples each column from a sampling distribution (based on either norm or 
leverage score estimation) with replacement; while in the sampling without replacement scheme a 
column is never sampled twice. Note that all theoretical results in Section|^are proved for sampling 
with replacement algorithms. 

From Figure we observe that all algorithms perform similarly, with the exception of two 
sampling with replacement algorithms and iterative norm sampling when both rank and missing 
rate are high. For the latter case, we conjecture that the degradation of performance is due to 
inaccurate norm estimation of column residues; in fact, the iterative norm sampling only provably 
works when the input matrix has a low-rank plus noise structure (see Theorem [^. On the other 
hand, when either the target rank or the missing rate is not too high iterative norm sampling works 


^We discuss on the poor performance of with replacement algorithms in Section 


7.5 
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Figure 3: Selection error on matrices with varying number of repeated columns. Both s and k are 
set to 15 and the noise-to-signal ratio a is set to 0.1. a indicates the observation rate. All algorithms 
are run for 8 times on each dataset and the median error is reported. 


just as good; it is particularly competitive when the true rank of the input matrix is low (see the top 
row of Figure [T]l. 

When the input matrix has coherent columns, as shown in Figurej^ it becomes easier to observe 
performance gaps among different algorithms. The block OMP algorithm completely fails in such 
cases and the selection error for group Lasso also increases considerably. This is due to the fact that 
both algorithms observe matrix entries by sampling uniformly at random and hence could be poorly 
informed when the underlying matrix is highly coherent. On the other hand, both leverage score 
sampling and iterative norm sampling are more robust to column coherence. The coherence among 
columns also makes the separation between norm sampling and volume sampling clearer in Figure 
1^ In particular, there is a significant gap between the two sampling with replacement curves and 
the norm sampling algorithm degrades to its worst-case additive error bound (see Theorem [T]l. The 
gap between the sampling without replacement curves is smaller since the coherent column is only 
repeated for 5 times in the design and so an algorithm can not be “too wrong” if it samples columns 
without replacement. 

To further investigate how the proposed and baseline algorithms adapt to different levels of 
coherence, we report in Figurej^the selection error on noisy low-rank matrices with varying number 
of repeated columns. Matrices with more repeated columns have higher coherence level. We can see 
that there is a clear separation of two groups of algorithms: the first group includes norm sampling, 
block OMP and group Lasso, whose error increases as the matrix becomes more coherent. Also, 
design matrix assumptions (e.g., restricted isometry) are violated for group Lasso. This suggests 
that these algorithms only have additive error bounds, or adapt poorly to column coherence of the 
underlying data matrix. On the other hand, the selection error of volume sampling and iterative 
norm sampling remains stable or slightly decreases. This is consistent with our theoretical results 
that both volume sampling and iterative norm sampling enjoy relative error bounds. 


6.2 Application to tagging Single Nucleotide Polymorphisms (tSNPs) selection 

We apply our proposed methods on real-world genetic datasets. We consider the tagging Single 
Nucleotide Polymorphisms (tSNP) selection task as described in IIKC031 [PMJ+OVl . The task aims 
at selecting a small set of SNPs in human genes such that the selected SNPs (called tagging SNPs) 
capture the genetic information within a specific genome region. More specifically, given an ni x n 2 
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Figure 4: Selection error or sampling based algorithm on Hapmap phase2 dataset, a indicates the 
observation rate. Top row: top-/c PCA captures 95% variance within each SNP window; bottom 
row: top-A; PCA captures 98% variance within each SNP window. 


Table 2: Averaging SNP window sizes for different e values and number of selected columns per 
window. 



5 COLUMNS 

10 COLUMNS 

15 COLUMNS 

20 COLUMNS 

25 COLUMNS 

e = 95% 

63.4 

248.9 

516.3 

891.0 

1405.7 

e = 98% 

18.8 

62.1 

123.4 

203.8 

309.7 


matrix with each row corresponding to the genome expression for an individual, we want to select 
k columns (typically k <C n 2 ) corresponding to k tagging SNPs that best capture the entire SNP 
matrix across different individuals. Matrix column subset selection methods have been successfully 
applied to the tSNP selection problem iPMJ'*~07l . 

In this section we demonstrate that our proposed algorithms could achieve the same objective 
while allowing many missing entries in the raw data matrix. We also compare the selection error of 
the proposed methods under different missing rate and number of tSNP settings. We did not apply 
Block OMP and group Lasso because the former cannot handle coherent data matrices and the latter 
does not scale well. The dataset we used is the HapMap Phase 2 dataset HiHcOSH . For demonstration 
purposes, we use gene data for the first chromosome of a joint east Asian population consisting of 
Han Chinese in Beijing (CHB) and Japanese in Tokyo (JPT). The data matrix consists of 89 rows 
(individuals) and 311,854 columns (SNPs). Each matrix entry has two letters 6162 describing a 
specific gene expression for an individual. 

We follow fhe same step as described in IIJDMPllll to preprocess the data. We first convert the 
raw data matrix into a numerical matrix M with +1/0/-1 entries as follows: let Bi and B 2 be the 
bases that appear for the jth SNP. Fix an individual i with its gene expression 6162 - If ^ 1^2 = BiBi 
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Target rank (k) = 10 


Target rank (k) = 25 




Figure 5: Sorted column space leverage scores for different e and k settings. For each setting 50 
windows are picked at random and their leverage scores are plotted. Each plotted line is properly 
scaled along the X axis so that they have the same length even though actual window sizes vary. 


then Mjj is set to -1; else if 6162 = B 2 B 2 then Mjj is set 1; otherwise Mjj is set to 0. We 
further split the SNPs into multiple consecutive “windows” so that within each window w the SVD 
reconstruction error is no larger than e with e set to 5% and 2%. We 

refer the readers to Figure 1 in HJDMPlll for details of the preprocessing steps. Averaging window 
length (i.e., number of SNPs within each window) are shown in Tablej^for different k and e settings. 
After preprocessing, column subset selection algorithms are performed for each SNP window and 
the selection error is averaged across all windows, as reported in Figure The number of selected 
columns per window (fc) ranges from 5 to 25 and the sampling budget a ranges from 10% to 60%. 

In Figure we observe that iterative norm sampling and approximate leverage score sampling 
outperforms norm sampling by a large margin. This is because the truncated data matrix within 
each window is very close to an exact low-rank matrix and hence relative error algorithms achieve 
much better performance than additive error ones. In addition, approximate leverage score sampling 
significantly outperforms norm sampling under both the with replacement and without replacement 
schemes. This shows that the heterogeneity of human SNPs cannot be captured merely by their 
norms because the norm is simply the ratio of heterozygous within a population and provides little 
information for its importance across the entire chromosome. The spikiness of leverage score distri¬ 
bution is empirically verified in Figure Finally, we remark fhaf sampling wifhouf replacemenf is 
much better fhan sampling wifh replacemenf and should always be preferred in pracfice. We discuss 
on fhis aspecf in Section [73] 

6.3 Application to column-based image compression 

In fhis secfion we show how acfive sampling can be applied fo column-based image compression 
wifhouf observing entire images. Given an image, we firsf acfively subsample a small amounf of 
pixels from fhe original image. We fhen selecf a subsef of columns based on fhe observed pixels and 
reconsfrucf fhe enfire image by projecfing each column fo fhe space spanned by fhe selecfed column 
subsefs. 

In Figure we depicfed fhe final compressed image as well as infermediafe steps (e.g., subsam¬ 
pled pixels and selecfed columns) on fhe 512 x 512 8 -bil gray scale Lena sfandard fesf image. We 
also reporf fhe mean and sfandard deviation of selection error across 10 runs under differenl settings 
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(c) Iterative norm sampling. Selection error ||M — CC^^M||i?/||M||F = 0.088. 



(d) Approx, leverage score sampling (without replacement). ||M — CC^^M|j/^’/||M||i^’ = 0.103. 

Figure 6: Column-based image eompression results on the Lena standard test image. Left: aetively 
sampled image pixels; middle: the seleeted eolumns; right: the reeonstrueted images. Number of 
seleeted eolumns is set to 50 and the pixel subsampling rate a is set to 0.3. 




































































































































































Figure 7: Selection error ||M — CC^M||i? for the iterative norm sampling algorithm as a function 
of a (left), a/k (middle) and a/k"^ (right). Error curves plotted under 4 different rank {k) settings. 


of target column subset sizes in Table 

Table shows that the iterative norm sampling algorithm consistently outperforms norm sam¬ 
pling and so is the leverage score sampling method when the target column subset size is large, 
which implies small oracle error ||M — Mfc|||,. To get an intuitive sense of why this is the case, 
we refer the readers to the selected columns for each of the sampling algorithm as shown in Figure 
[^(the middle column). It can be seen that the norm sampling algorithm (Figure oversamples 
columns in relatively easy regions (e.g., the white bar on the left side and the smooth part of the 
face) because these regions have large pixel values (i.e., they are whiter than the other pixels) and 
hence have larger column norms. In contrast, the iterative norm sampling algorithm (Figurefo¬ 
cuses most sampled columns on the tassel and hair parts which are complicated and cannot be well 
approximated by other columns. This shows that the iterative norm sampling method has the power 
to adapt to highly heterogeneous columns and produce better approximations. Finally, we remark 
that though both leverage score sampling and iterative norm sampling have relative error guaran¬ 
tees, in practice the iterative norm sampling performs much better than leverage score sampling for 
matrices whose rank is not very high. 

7 Discussion 

We discuss on several aspects of the proposed algorithms and their analysis. 

7.1 Limitation of passive sampling 

In most cases the observed entries of a partially observable matrix are sampled according to some 
sampling schemes. We say a sampling scheme is passive when the sampling distribution (i.e., 
probability of observing a particular matrix entry) is fixed a priori and does not depend on the data 
matrix. On the other hand, an active sampling scheme adapts its sampling distribution according 
to previous observations and request unknown data points in a feedback driven way. We mainly 
focus on active sampling methods in this paper (both Algorithm [T] and [^perform active sampling). 
However, Algorithm only requires passive sampling because the sampling distribution of rows is 
the uniform distribution and is fixed a priori. 

Passive sampling is known fo work poorly for coherenf mafrices HKS141ICBSW131 . In fhis 
section, we make fhe following fhree remarks on fhe power of passive sampling for column subsef 
selection: 


26 























Table 3: Relative selection error ||M — CC^M||j 7 ’/||M||ir on the standard Lena test image (512 x 
512) for norm sampling (NORM), iterative norm sampling (Iter, norm) and approximate leverage 
score sampling (Lev. score). Results also compared to a uniform sampling baseline (Unieorm) 
and the truncated SVD lower bound (S VD). The percentage of observed entries a is set to a = 30%. 
Number of columns used for reconstruction varies from 25 to 100. 



Unieorm 

Norm 

Iter, norm 

Lev. score 

SVD 

25 COLUMNS 

.151± .009 

.147± .004 

.136 ± .004 

.148 ± .007 

.092 

50 COLUMNS 

.104± .004 

.103 ± .003 

.092 ± .001 

.105 ± .003 

.059 

100 COLUMNS 

.064 ± .002 

.065 ± .001 

.053 ± .001 

.061 ± .002 

.032 


Remark 1 The ||M — CX||g reconstruction error bound for column subset selection is hard for 
passive sampling. In particular, it can be shown that no passive sampling algorithm achieves relative 
reconstruction error bound with high probability unless it observes Q{nin 2 ) entries of an rii x n 2 
matrix M. This holds true even if M is assumed to be exact low rank and has incoherent column 
space. 

This remark can be formalized by noting that when M is exact low rank then relative recon¬ 
struction error implies exact recovery of M, or in other words, matrix completion. Here we cite 
the hardness result in HKS1411 for completing coherent matrix by passive sampling. Similar results 
could also be obtained by applying Theorem 6 in 1ICBSW13I1 . 

Theorem 7 (Theorem 2, BKS1411 1. Let X denote all ni x n 2 matrices whose rank is no more than 
k and column space has incoherence po as defined in Eq. ( 0 ). Fix m < nin 2 and let Q denote ail 
passive sampling distributions over m samples of nin 2 matrix entries. Let %" = {/: X} be 

the collection of (possibly random) matrix completion algorithms. We then have 


Kac ■= inf inf sup Pr [/(H, X^) 7 ^ X] > 


1 

2 


m 


1 

2(n — k) ’ 


(41) 


where n = max(ni,n 2 ). a remark, when po is a constant then = H(l) whenever m = 
o(ni(n 2 - k)). 


Remark 2 For the ||M — CCfM||^ selection error (with only column indices C output by an 
CSS algorithm), it is possible for a passive sampling algorithm to achieve a relative error bound 
with high probability. In fact. Algorithm 0 and Theorem 0 precisely accomplish this. In addition, 
when the input matrix is exact low rank. Theorem 0 implies that there exists a passive sampling 
algorithm that outputs a small subset of columns which span the entire column subspace of a row- 
coherent matrix with high probability. This result shows column subset selection is easier than 
matrix completion when only indices of the selected column subset are required. It does not violate 
Theorem 0 however, because knowing which columns span the column space of an input matrix 
does not imply we can complete the matrix without further samples. 


Remark 3 Although Remark 2 and Theorem 0 shows that it is possible to achieve relative ||M — 
CCtMlI F error bound for row coherent matrices via passive sampling, we show in this section that 
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passive sampling is insufficient under a slightly weaker notion of column incoherence. In particular, 
instead of assuming /i(^) < /xq on the column space as in Eq. Q, we assume < /xi for 

every column Xi as in Eq. Q. Note that if rank(Z^) = k and Xi £ U then /u(a;j) < kiJi{U). So 
for exact low rank matrices the vector-based incoherence assumption in Eq. Q is weaker than the 
subspace-based incoherence assumption in Eq. We then have the following theorem, which is 
proved in Appendix [C| 

Theorem 8. Let X' denote all ni x n 2 matrices whose rank is no more than k and incoherence 
/xi > 1 -|- as defined in Eq. (|^/or each column. Fix m < nin 2 and let Q denote all passive 
sampling distributions over m samples of nin 2 matrix entries. Let X' = {f : M™' —)• [n. 2 ]^} be the 
collection of (possibly random) column subset selection algorithms. We then have 


P* 

JXf. 


:= inf inf sup Pr [X 7 ^ X^X^X 


f&T' geS xeA" 


- 


>- 


m 


2ni{n2 — k) ’ 


(42) 


where C = /(X, Xf^) is the output column subset of f. As a remark, the failure probability 
satisfies = 12 ( 1 ) whenever m = o(ni(n 2 — k)). 

Theorem combined with Theorem shows a separation of hardness between column subset 
selection and matrix completion. It also formalizes the intuitive limited power of passive sampling 
over coherent matrices. 


7.2 Time complexity 

In this section we report the theoretical time complexity of our proposed algorithms as well as the 
optimization based methods for comparison in Table We assume the input matrix M is square 
n X n and we are using s columns to approximate the top-fe component of M. Eet a = m/n^ he. the 
percentage of observed data. svd(a, 6 , c) denotes the time for computing the top-c truncated SVD 
of an a X 6 matrix. 

Suppose the observation ratio a is a constant and the svd operation takes quadratic time. Then 
the time complexity for all algorithms can be sorted as 


Norm; 0(n"') < Eev. score; 0(A:n"') < Iter, norm, Block OMP; 0(sn'^) < gEasso, 0(r(n'^-hs"'n"')). 

(43) 


Perhaps not surprisingly, in Section 6.2 and 6.3 on real-world data sets we show the reverse holds 
for selection error for the hrst three algorithms in Eq. ([43]). 


Table 4: Time complexity of proposed and baseline algorithms, k denotes the intrinsic rank and s 
denotes the number of selected columns. Dependency on failure probability <5 and other polyloga- 
rithmic dependency is omitted. 


Algorithm 

Norm 

Iter, norm* 

Lev. score 

Block OMP* 

GLASSot 

Time Complexity 

O(an^) 

0(a^srfi) 

0(svd(Q!n, n, k)) 

O(a^sn^) 

0{T(rc‘ + s^v?)) 


* Assume an > s and a^n > 1. 

^Using solution path implementation; T is the desired number of A values. 
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7.3 Sample complexity, column subset size and selection error 

We remark on the connection of sample complexity (i.e., number of observed matrix entries), size of 
column subsets and reconstruction error for column subset selection. For column subset selection 
when the target column subset size is fixed the sample complexity acts more like a threshold: if 
not enough number of matrix entries are observed then the algorithm fails since the column norms 
are not accurately estimated, but when a sufficient number of observations are available the re¬ 
construction error does not differ much. Such phase transition was also observed in other matrix 
completion/approximation tasks as well, for example, in IKS 1411 . In fact, the guarantee in Eq. Q, 
for example, is exactly the same as in IIFKV04I under the fully observed setting, i.e., mi = ni. 

The bottom three plots in Figure are an excellent illustration of this phenomenon. When 
a = 0.3 the selection error of Algorithm[^is very high, which means the algorithm does not have 
enough samples. However, for a = 0.6 and 0 = 0.9 the performance of Algorithm|^is very similar. 

7.4 Sample complexity of the iterative norm sampling algorithm 

We try to verify the sample complexity dependence on the intrinsic matrix rank k for the iterative 
norm sampling algorithm (Algorithm [^. To do this, we run Algorithm under various settings of 
intrinsic dimension k and the sampling probability a (which is basically proportional to the expected 
number of per-column samples m). We then plot the selection error ||M — CC^MIIj? against a, 
ajk and ajk^ m Figure 

Theoremstates that the dependence of m on /c should be m = 0{k‘^) ignoring logarithmic 
factors. However, in Figure]^ one can observe that when the selection error is plotted against a/k 
the different curves coincide. This suggests that the actual dependence of m on A; should be close to 
linear instead of quadratic. It is an interesting question whether we can get rid of the use of union 
bounds over all n 2 -choose-A: column subsets in the proof of Theorem|^in order to get a near linear 
dependence over k. Note that the curves converge to different values for different k settings because 
selection error decreases when more columns are used to reconstruct the input matrix. 

7.5 Sampling with and without replacement 

In the experiments we observe that for norm sampling (Algorithm [T]) and approximate leverage 
score sampling (Algorithm the two column sampling schemes, i.e., sampling with and without 
replacement, makes a big difference in practice (e.g., see Figure[T][^ and[^. In fact, sampling with¬ 
out replacement always outperforms sampling with replacement because under the latter scheme 
there is a positive probability of sampling the same column more than once. Though we analyzed 
both algorithm under the sampling with replacement scheme, in practice sampling without replace¬ 
ment should always be used since it makes no sense to select a column more than once. Finally, 
we remark that for iterative norm sampling (Algorithm]^ a column will never be picked more than 
once since the (estimated) projected norm of an already selected column is zero with probability 1. 
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Appendix A Analysis of the active norm sampling algorithm 


Proof of Lemma^ This lemma is a direct corollary of Theorem 2 from iFKVOll. First, let Pi = 
Cij f he. the probability of selecting the i-th column of M. By assumption, we have 
Applying Theorem 2from IIFKV04I we have that with probability at least 1 — 5, there exists an 
orthonormal set of vectors • • • , G in span(C) such that 

2 

< ||M-Mfc|||+ ||M|||. (44) 

(1 — a)os 
F 

Finally, to complete the proof, note that every column of M can be represented 

as a linear combination of columns in C; furthermore. 


M- 


yiPyUf 


M 




min ||M- CXili. < 

XglRfcXna 





(45) 

□ 


Proof of Theorem^I^ First, set mi = n(//o log(n2/5i)) we have that with probability > 1 — 5i the 
inequality 

(1 - a)||£Ci||2 < Cj < (1 + a:)||a3i||2 

holds with a = 0.5 for every column i, using Lemma|^ Next, putting s > 6kj52e^ and applying 
Lemma [T] we get 

||M - rc{M)\\F < ||M - Mfcll^ + e||M||ir (46) 

with probability at least 1 — 52- Finally, note that when a < 1/2 and ni < n 2 the bound in Lemma 
[^is dominated by 

||M-M||2 < ||M||j.-0 ■ (47) 


Consequently, for any e' > 0 if m 2 
>1 — 53 


= 12((e') “^yolog^fni + 0 . 2 )/S 3 ) we have with probability 
11XI — ]V[||2 ^ e^||]V[||i7’. (48) 


The proof is then completed by taking e' = el\fs\ 


M-CX||f = 
< 
< 
< 
< 


M-)Pc(M)||i. 

M - Vc{M)\\f + \\Vc{M - M)||i. 

M - Mk\\F + e||M||i. + V^WVciM - M)||2 
XI — Xlfclli? + ellXIIIi? + \/~s ■ 11XI 111? 

XI — XIi-lli? + 2e||XI||i?. 


□ 

"'The original theorem concerns random samples of rows; it is essentially the same for random samples of columns. 



















Appendix B Analysis of the iterative norm sampling algorithm 


Proof of Lemma^ We first prove Eq. (16l. Observe that dim (^(C)) < s.LetRc = G 

j^mxs denote the selected s columns in the noise matrix R and let TZ{C) = span(Rc) denote the 
span of selected columns in R. By definition, ZY(C') C UyjTZ{C), where U = span(A) denotes the 
subspace spanned by columns in the deterministic matrix A. Consequently, we have the following 
bound on ||'P 2 ^(c)ei|| (assuming each entry in R follows a zero-mean Gaussian distribution with cj^ 
variance): 






•I II 2 


< 


< 


(RcR-c) 


— 11|2 IIidT 
2 llxV/^C. 


■C'^*ll 2 


^ + l|Rcll^l 


R (7 and Lemma 


12 


For the last inequality we apply Lemma 14 to bound the largest and smallest singular values of 

|2 


to bound 


|Rjej|| 2 , because R^e,; follow i.i.d. Gaussian distributions with 




covariance cr^Isxs- If e is set as e = \/2 log(4/(5) then the last inequality holds with probability at 
least 1 — 5. Furthermore, when s < ni/2 and 5 is not exponentially small (e.g., y^ 2 log(4/5) < 

the fraction is approximately 0(l/ni). As a result, with probability 1 — ni5 

the following holds: 


^(ZY(C)) = — max 11^^(06*111 


s \ ni 


s -b 


V^slog(l/5) -Hog(l/5) \ 

ni / 


^ ^ fc/XQ + g -b log(l/5) -b log(l/5) ^ 

(49) 


Finally, putting 5' = ni/5 we prove Eq. ( [T^ . 

Next we try to prove Eq. ([n]). Let X be the i-th column of M and write x = a + r, where 
a = Vu{x) and r = (x). Since the deterministic component of x lives in U and the random 

component of a; is a vector with each entry sampled from i.i.d. zero-mean Gaussian distributions, 
we know that r is also a zero-mean random Gaussian vector with i.i.d. sampled entries. Note that 
Pl{C) does not depend on the randomness over : i f. Qf Therefore, in the following analysis 

we will assume ff(C') to be a fixed subspace U with dimension at most g. 

The projected vector x' = x can be written as s = d-bf, where d = P^^ a and r = 

By definition, d lives in the subspace lA n U^. So it satisfies the incoherence assumption 

II ~ ||2 

/x(d) = ^ < kjjiiU) < k^Q. (50) 

On the other hand, because r is an orthogonal projection of some random Gaussian variable, f is 
still a Gaussian random vector, which lives in CiU-^ with rank at least ni — k — s. Subsequently, 
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we have 




lIttiP IlhiP 

< 3niV^ + 3ni" 

W rtW^ 


I|a|l2 


fl|2 
' II2 


< 3/c/io + 


Gcr^ni log(2nin2/(5) 


a‘^{ni — k — s) — (ni — k — s) \og{n 2 /S) 


For the second inequality we use the fact that < Yli ^ whenever ai,bi > 0. For the last 
inequality we use Lemma [13] on the enumerator and Lemma [1^ on the denominator. Finally, note 
that when max(s, k) < ni/4 and log(n2/<5) < ni/64 the denominator can be lower bounded by 
(T^ni/4; subsequently, we can bound fi{x) as 


^{x) < 3/c^o + 


24(T^ni \og{2nin2/6) 


cj^ni 


< 3/c^o + 24 log(2ni 77 . 2 / 5 ). 


Taking a union bound over all n 2 — s columns yields the result. 


(51) 


□ 


To prove the norm estimation consistency result in Lemma we first cite a seminal theorem 
from IKS 1411 which provides a tight error bound on a subsampled projected vector in terms of the 
norm of the true projected vector. 


Theorem 9. Let U be a k-dimensional subspace ofW^ and y = x + v, where x and v G U-^. 
Fix 5' > 0, m > max.{^kp{U) log (|^) , 4/7 (d) log(l/5')} and let Ll be an index set with entries 
sampled uniformly with replacement with probability m/n. Then with probability at least 1 — 45'.’ 


m(l - a)- kp{U)^ 


m , 


n 


■*^112 ^ WUn ~ ^ (l + “)~ll'^ll2! 


(52) 


where a = 2^ log(l/5')+2^ log(l/5'). P = {l+2y/\og{l/5')f andy = 


We are now ready to prove Lemma 

Proof of Lemma^ By Algorithm]^ we know that dim(5t) = t with probability 1. Let y = 
denote the i-th column of M and let v = VstV be the projected vector. We can apply Theorem|^to 
bound the estimation error between ||r)|| and \\yQ — 'Pst{o)yQ\\- 


First, when m is set as in Eq. (201 it is clear that the conditions m > ^tp{L{) log (|^) = 
Ll{kyolog{n/S)log{k/6')) and r?7 > Ap{v)\og{l/6') = Vl{kyQ\og{n/6)\og{l/5')) are satisfied. 
We next turn to the analysis of a, /3 and 7. More specifically, we want a = 0(1), 7 = 0(1) and 

^/3 = 0(1)- 

Fora, a = 0(1) implies 777 = f2(|u(r7) log(l/5')) = Vl{kpQ\og{n/5) log(l/5')). Therefore, by 
carefully selecting constants in ()(•) we can make a < 1/4. 

For 7, 7 = 0(1) implies m = Llfpifi) log(t/5')) = LlfkpQ log(n/5) log(A:/5')). By carefully 
selecting constants in n(-) we can make 7 < 0.2. 
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For (3, = 0(1) implies m = 0{tfi{TJ)(3) = 0(A:/ro log(n/(5) log(l/(5')). By carefully 

selecting constants we can have /3 < 0.2. Finally, combining bounds on a, /3 and 7 we prove the 
desired result. □ 

Before proving Lemma we first cite a lemma from IIDRVW061I that connects the volume of a 
simplex to the permutation sum of singular values. 

Lemma 9 ( IIDRVW06I L Fix A G with m < n. Suppose (Ti, • • • , (Jm singular values of 

A. Then 

Y. = 7 ^ E 44(53) 

SC[n],|S|=fc ' 


Now we are ready to prove Lemma 

Proof of Lemma^ Let denote the best rank-A; approximation of M and assume the singular 
values of M are Let C = {ii, • • • ,ik} be the selected columns. Let r G 11^, where 

Ilfc denotes all permutations with k elements. By LAr,* we denote the linear subspace spanned 
by and let d(Mb),denote the distance between column and 

subspace LAr.t- We then have 


PC < 


< 


< 



\M\f 




2.5^ 

2 .s'' 
2 . 5 '' 
2 . 5 '' 
2 . 5 '' 


Eren^ ||MWb))||2^(MWL)),7^^^^)2 ... 
Erm, (fc!)^vol(A(C))^ 
(A:!)3vo1(A(C))2 

eS4Fe&F'^eSI4 

(A;!)3vo1(A(C))2 

E 2 ^2 . . . ^2 


kWo\{A{C)f 

ET:|T|=fc Vol(A(r))2 


2 .b^k\pc. 


For the first inequality we apply Eq. (231 and for the second to last inequality we apply Lemma 

HI □ 


Lemmaj^can be proved by applying Theorem|^for T rounds, given the norm estimation accu¬ 
racy bound in Proposition 

Proof of Lemma^ First note that 

||M - Pwu5iU-U5t(M)|||’ < ||M - 7^wu5iU-U5T,fc(M)llF' 
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Applying Theoremwith = |, we have 


< 

< 

< 

< 


E[\\M-Vuus,u-uSrm\\l] 

||M - Mfclll + —E [||M - iPwu5iU...5r_i(M)||^]' 


5A: 

|M-Mfc||l. + — ( ||M-Mfc||| + 


5k 




^ '5 


2 St 
T 




StST-1 


+ 


+1? 


2st-i 

T-1 


E[||M-iPwu5iu...5^_.(M)|||] 

k'^~^ ^ 


sr-i•••Si 


|M-Mfc| 


k^ 


stst-1 • ••Si 


< 

< 


1 + ^) WM-Mkfp + ^WBfp 


\M-VuiM)fp 
l|M-Mfc||^ + ^||E||^ 


Finally applying Markov’s inequality we complete the proof. 


□ 


To prove the reconstruction error bound in Lemma we need the following two technical lem¬ 
mas, cited from IIKS131IBRN10I . 

Lemma 10 ( IIKS131 ). Suppose lA C has dimension k and U G is the orthogonal ma¬ 

trix associated with U. Let Ll C [n] be a subset of indices each sampled from i.i.d. Bernoulli 
distributions with probability mjni. Then for some vector y G M"', with probability at least 1 — 5: 


l|uSyolll</3-^ll«lll. 

ni ni 


(54) 


where /3 is defined in Theorem^ 


Lemma 11 ( IIBRNIOII I. With the same notation in Lemma and Theorem With probability 
>1 — 5 one has 

(55) 


||(uTun)-i< 


m 


(1 — 7)m’ 


provided that 7 < 1 . 


Now we can prove Lemma[^ 

Proof of Lemma Let U =U {S) and U G ^ ® be the orthogonal matrix associated with U. Fix 
a column i and let x = = a + r, where a ^ U and r G lA^. What we want is to bound 

\\x — U(UQUf 2 )~^UQa;n||| in terms of HrUl. 

Write a = Ua. By Lemma [m if m satisfies the condition given in the Lemma then with 
probability over 1 — 5 — 5" we know (U^Uq) is invertible and furthermore, ||(Uj! 2 Ufi )“^||2 < 
2nijm. Consequently, 


U(U^Un)-^Us^an = U(U^Un)-^U^Ufia = Ud = a. (56) 
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That is, the subsampled projector preserves components of x in subspace U. 

Now let’s consider the noise term r. By Corollary with probability > 1 — (i we can bound 
the incoherence level of y as y{y) = 0{syQ \og{n/6)). The incoherence of subspace U can also be 
bounded as y,{U) = 0(/ro log(n/(5)). Subsequently, given m = n(e“^s^o log(n/(5) log(n/(5")) we 
have (with probability > 1 - <5 - 26") 

\\x-lJ{\jlVn)-^Vlia + r)\l 
= ||a + r-U(U^Uo)-'uT(a + r)||2 

= ||r-U(U;^Un)-'U^r||i 

< IWIi + ll(u?2Uo)-1il|uTr||i 

< {l + 0 {e))\\r\\l 


For the second to last inequality we use the fact that r G . By carefully selecting constants in 


Eq. (221 we can make 


- U(Uf^Un)-'UA*||^ < (1 + e)\\Vu±x\\i. 


(57) 


Summing over all n 2 columns yields the desired result. 


□ 


Appendix C Proof of lower bound for passive sampling 


Proof of Theorem^ Let X = {Xi, • • • , C be a finite subset of X' which we specify later. 
Let TT be any prior distribution over X. We then have the following chain of inequalities: 


Rl,, = inf inf sup Pr [X + XcX^X] 
/e.F' q&QxeX' 

> inf inf Pr [X / XcXj,X] 


/S.F' q&Q r2~(j;X~7r;/ 


> inf min Pr [X XcXLxl. 

/e.F'|ft|=mX~7r;/ 


(58) 

(59) 


Here Eq. ( |5^ uses the fact that the maximum dominates any expectation over the same set and for 
Eq. (591 we apply Yao’s principle, which asserts that the worst-case performance of a randomized 
algorithm is better (i.e., lower bounded) by the averaging performance of a deterministic algorithm. 
Hence, when the input matrix X is randomized by a prior tt it suffices fo consider only deferminisfic 
sampling schemes, which corresponds fo a subsef of mafrix enfries fixed a priori, wifh size |n| = 


m. 

We nexf consfrucf fhe subsef X andlef vr be fhe uniformdisfribufion over X. Lef xi, - ■ ■ ,Xk -2 G 
be an arbifrary sef of linear independenf column vecfors wifh [xi]i = 0 for alH = 1, 2, • • • ,k = 
2 and - ,/j,{xk- 2 ) = 1 + ■ This can be done by selling all nonzero enfries in 

xi, • • • , xi ^_2 to ±1. In addition, we define r/ := (1,1, • • • , 1) and Sj = (0, • • • , 0,1, 0, • • • , 0) 
wifh fhe only nonzero enlry al fhe ylh posifion. Nexf, define X = wifh 


{ Xu if ^ < k — 2 , 

y - 2 ej ifi = i, (60) 

y ofherwise. 
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It follows by definition that rank(X*’'^) = k and = 1 + for all i,j and £. 

Furthermore, for fixed i and j one necessary condifion for X = X^X^X is {1, 2, • • • , fe — 2, i} C 
C. Therefore, if for disfincf and some ji, j 2 , js, ji one has = ... = X^’-^'* fhen 

the best a column subset selection algorithm / could do is random guessing and hence Pr[X ^ 
Xc-X^X] >ll2. Consequently, for fixed Q one has 


inf Pr [X ^ XcX;Lxl > - - - 
feT' ^ o c J - 2 2 


{X*’^' : ^ X*^ Vz' i,f G [m]} 


(61) 


The final sfep is fo bound fhe size of fhe sef E = {X*’-^ : X^’-^ / X®’-^ , Vi' ^ i,j' G [ni]}. 
Note fhaf if X^ is +1 on all enfries (i, j) wifh i > k — 2 fhen X. ^ E because for every X' G X, 
Xq = Xq. Consequenfly, 


E < 


|F!| 


< 


m 


ni{n2-k + 2) ni{n2-k)' 


Plugging Eq. (|6^ info Eq. (|6T]) we complefe fhe proof of Theorem 


(62) 

□ 


Appendix D Some concentration inequalities 

Lemma 12 f liEMOOI f. Let X ~ Xd- Then with probability >1 — 25 the following holds: 

-2ydTog(I7^<X-(i<2ydbg(I7^ + 21og(l/5). (63) 


Lemma 13. Let Xi, • • • , Xn ~ AA(0, cr^). Then with probability > 1 — (i the following holds: 


max|Xj| < ay^2\og{2n/5). (64) 


Lemma 14 f llVerlOII '). Let X be an n X t random matrix with Ltd. standard Gaussian random 
entries. Ift < n then for every e > 0 with probability >1 — 2 exp(—e^/2) the following holds: 

^/n - Vi - € < (Tmin(X) < (J ma v(X) < V^ + Vt + 6. (65) 


Lemma 15 (Noncommufafive Bernstein Inequalify, IIGLE+ lOl Reel 111 ). Let Xi, • • • , X^ be inde¬ 
pendent zero-mean square nxn random matrices. Suppose pI. = max(||E[XfcX^] II 2 , ||E[XjXfc]|| 2 ) 
and ||Xfc ||2 < M with probability 1 for all k. Then for any f > 0, 


Pr 

m 

> t 

_ 

k=l 

2 


< 2nexp 


E 


m 

k=l 


f2/2 

Pk Mt/3 


( 66 ) 


36 























References 


[AGR16] 

[AKL13] 

[AM07] 

[BDMI14] 

[BJS15] 

[BMD09] 

[BNBIO] 

[BRNIO] 

[BW14] 

[BXMIO] 

[CBSW13] 

[Cha87] 

[CLM+15] 

[CPIO] 

[DKM06a] 


Nima Anari, Shayan Oveis Gharan, and Alireza Rezaei. Monte carlo markov chain 
algorithms for sampling strongly rayleigh distributions and determ in an tal point pro¬ 
cesses, 2016. 

Dimitris Achlioptas, Zohar Karnin, and Edo Liberty. Near-optimal entrywise sam¬ 
pling for data matrices. In NIPS, 2013. 

Dimitris Achlioptas and Frank McSherry. Fast computation of low-rank matrix ap¬ 
proximations. Journal of the ACM, 2007. 

Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal 
column-based matrix reconstruction. SIAM Journal on Computing, 43(2):687-717, 
2014. 

Srinadh Bhojanapalli, Prateek Jain, and Sujay Sanghavi. Tighter low-rank approxi¬ 
mation via sampling the leveraged element. In SODA, 2015. 

Christos Boutsidis, Michael Mahoney, and Petros Drineas. An improved approxi¬ 
mation algorithm for the column subset selection problem. In SODA, 2009. 

Laura Balzano, Robert Nowak, and Waheed Bajwa. Column subset selection with 
missing data. In NIPS Workshop on Low-Rank Methods for Large-Scale Machine 
Learning, 2010. 

Laura Balzano, Benjamin Recht, and Robert Nowak. High-dimensional matched 
subspace detection when data are missing. In ISIT, 2010. 

Christos Boutsidis and David P Woodruff. Optimal CUR matrix decompositions. In 
STOC, 2014. 

Jacob Bien, Ya Xu, and Michael Mahoney. CUR from a sparse optimization view¬ 
point. InNIPS, 2010. 

Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, and Rachel Ward. Completing 
any low-rank matrix, provably. arXiv:1306.2979, 2013. 

Tony F Chan. Rank revealing QR factorizations. Linear Algebra and Its Applica¬ 
tions, 88:67-82, 1987. 

Michael B Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, 
and Aaron Sidford. Uniform sampling for matrix approximation. In ITCS, 2015. 

Emmanuel J Candes and Yaniv Plan. Matrix completion with noise. Proceedings of 
the IEEE, 98(6):925-936, 2010. 

Petros Drineas, Ravi Kannan, and Michael Mahoney. Fast monte carlo algorithms 
for matrices I: Approximating matrix multiplication. SIAM Journal on Computing, 
36(1):132-157, 2006. 


37 



[DKM06b] 

[DKM06c] 

[DMIMW12] 

[DMM08] 

[DR 10] 
[DRVW06] 

[DV06] 

[FKV04] 

[GE96] 

[GLF+10] 

[HCNll] 

[iHc03] 

[JDMPll] 


Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast monte carlo algorithms 
for matrices II: Computing a low-rank approximation to a matrix. SIAM Journal on 
Computing, 36(1):158-183, 2006. 

Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast monte carlo algorithms 
for matrices III: Computing a compressed approximate matrix decomposition. SIAM 
Journal on Computing, 36(1): 184-206, 2006. 

Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. 
Fast approximation of matrix coherence and statistical leverage. The Journal of 
Machine Learning Research, 13(l):3475-3506, 2012. 

Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Relative-error CUR 
matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 
30(2): 844-8 81,2008. 

Amit Deshpande and Fuis Rademacher. Efficient volume sampling for row/column 
subset selection. In FOCS, 2010. 

Amit Deshpande, Fuis Rademacher, Santosh Vempala, and Grant Wang. Matrix 
approximation and projective clustering via volume sampling. Theory of Computing, 
2:225-247, 2006. 

Amit Deshpande and Santosh Vempala. Adaptive sampling and fast low-rank matrix 
approximation. In Approximation, Randomization, and Combinatorial Optimiza¬ 
tion. Algorithms and Techniques, pages 292-303. 2006. 

Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast monte-carlo algorithms for 
finding low-rank approximafions. Journal of the ACM, 51(6): 1025-1041, 2004. 

Ming Gu and Sfanley C Eisensfaf. Efficienl algorifhms for computing a sfrong rank- 
revealing QR factorization. SIAM Journal on Scientific Computing, 17(4):848-869, 
1996. 

David Gross, Yi-Kai Fiu, Sfeven T Flammia, Stephen Becker, and Jens Eis- 
erf. Quanfum sfafe fomography via compressed sensing. Physical review letters, 
105(15):150401,2010. 

Jarvis Haupf, Rui M Casfro, and Roberf Nowak. Disfilled sensing: Adapfive sam¬ 
pling for sparse defection and esfimafion. IEEE Transactions on Information Theory, 
57(9):6222-6235, 2011. 

The international HapMap consorfium. The infernafional HapMap projecf. Nature, 
437:789-796, 2003. 

Asif Javed, Pefros Drineas, Michael Mahoney, and Perisfera Paschou. Efficienl 
genomewide selection of PCA-correlafed iSNPs for genolype impulalion. Annals 
of Human Genetics, 75(6):707-722, 2011. 


38 



[KC03] 

[KMOlO] 

[KS13] 

[KS14] 

[LMOO] 

[LPVDGTll] 

[MRS+10] 

[PMJ+07] 

[Recll] 

[Trol2] 

[VerlO] 

[WZ13] 

[XJZ15] 

[YL06] 

[YZ14] 


Xiayi Ke and Lon Garden. Efficient selective screening of haplotype tag SNPs. 
Bioinformatics, 19(2):287-288, 2003. 

Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion 
from a few entries. IEEE Transactions on Information Theory, 56(6):2980-2998, 
2010 . 

Akshay Krishnamurthy and Aarti Singh. Low-rank matrix and tensor completion 
via adaptive sampling. In NIPS, 2013. 

Akshay Krishnamurthy and Aarti Singh. On the power of adaptivity in matrix com¬ 
pletion and approximation. arXiv:1407.3619, 2014. 

Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional 
by model selection. The Annals of Statistics, 28(5):1302-1338, 2000. 

Karim Lounici, Massimiliano Pontil, Sara Van De Geer, and Alexandre B Tsybakov. 
Oracle inequalities and optimal inference under group sparsity. The Annals of Statis¬ 
tics, pages 2164-2204, 2011. 

Sofia Mosci, Lorenzo Rosasco, Matteo Santoro, Alessandro Verri, and Silvia Villa. 
Solving structured sparsity regularization with proximal methods. In ECMUPKDD, 
2010 . 

Peristera Paschou, Michael Mahoney, Asif Javed, Judith Kidd, Andrew Pakstis, 
Sheng Gu, Kenneth Kidd, and Petros Drineas. Intra- and interpopulation genotype 
reconstruction from tagging SNPs. Genome Research, 17(1):96-107, 2007. 

Benjamin Recht. A simpler approach to matrix completion. The Journal of Machine 
Learning Research, 12:3413-3430, 2011. 

Joel Tropp. User-friendly tail bounds for sums of random matrices. Eoundations of 
Computational Mathematics, 12(4):389^34, 2012. 

Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. 
arXiv:I011.3027, 2010. 

Shusen Wang and Zhihua Zhang. Improving CUR matrix decomposition and the 
nystrdm approximation via adaptive sampling. The Journal of Machine Learning 
Research, 14(l):2729-2769, 2013. 

Miao Xu, Rong Jin, and Zhi-Hua Zhou. CUR algorithm for partially observed ma¬ 
trices. In ICML, 2015. 

Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped 
variables. Journal of the Royal Statistical Society: Series B (Statistical Methodol¬ 
ogy), 68(l):49-67, 2006. 

Yi Yang and Hui Zou. A fast unified algorithm for solving group-lasso penalize 
learning problems. Statistics and Computing, pages 1-13, 2014. 


39 



